The goal for this project is to create a model that can predict where in the draft an NFL prospect will be drafted. I will be combining multiple data sets to yield better results, the data points I’m going to be focusing on are a player’s position, what school they went to, and their combine stats. This is a multiclass classification problem that will use logistical models to make its predictions.
The NFL Draft is an opportunity to for NFL teams to select players. Each team picks in order relative to how well they did the previous season. The team with the worst record picks first, then the second worst record picks second, and so on. This continues until all 32 NFL teams have picked a player, then they redo the same process in the same order. Each 32 group of picks is referred to as a “round” and there are 7 total rounds in the draft.
This model can be useful for the teams drafting players and the prospects entering the draft. For teams, this model could help evaluate players, and determine where they should be drafted. Also comparing them to mock drafts to see if they are under or over valued. For prospects, this model can be used to gain perspective on whether they should enter the draft before their senior year in college. If the model predicts they’ll be drafted, or have a high draft projection, then it may be a good time to declare. But, if the model predicts they won’t be drafted, or drafted too low, they should play another year of college to develop.
To find the appropriate data for my project, I searched through the kaggle data base and found two data sets that will be useful.
nfl_draft_prospectsSource: nfl_draft_prospects
Author: Jack Lichtenstein, Publication: May 5th 2021
Looking at the raw data there are 24 columns, but some of these can
be deleted since they aren’t helpful. Obvious ones include:
player_id, link, traded,
trade_note, team, team_abbr,
team_logo_espn, guid,
player_image. All of these are either links that we can’t
use, or are related to the team that drafted them (which we aren’t
interested in).
ndf_head <- nfl_draft_prospects %>% head()
ndf_head %>%
kable() %>%
kable_styling("striped", full_width = F) %>%
column_spec(1:ncol(ndf_head), extra_css = "white-space: nowrap;") %>%
row_spec(0, align = "c") %>%
scroll_box(width = "100%")
| draft_year | player_id | player_name | position | pos_abbr | school | school_name | school_abbr | link | pick | overall | round | traded | trade_note | team | team_abbr | team_logo_espn | guid | weight | height | pos_rk | ovr_rk | grade | player_image |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1967 | 23590 | Bubba Smith | Defensive End | DE | Michigan State | Spartans | MSU | http://insider.espn.com/nfl/draft/player/_/id/23590 | 1 | 1 | 1 | FALSE | from New Orleans | Baltimore Colts | IND | https://a.espncdn.com/i/teamlogos/nfl/500/scoreboard/ind.png | NA | NA | NA | NA | NA | NA | NA |
| 1967 | 23591 | Clinton Jones | Running Back | RB | Michigan State | Spartans | MSU | http://insider.espn.com/nfl/draft/player/_/id/23591 | 2 | 2 | 1 | FALSE | from N.Y. Giants | Minnesota Vikings | MIN | https://a.espncdn.com/i/teamlogos/nfl/500/scoreboard/min.png | NA | NA | NA | NA | NA | NA | NA |
| 1967 | 23592 | Steve Spurrier | Quarterback | QB | Florida | Gators | FLA | http://insider.espn.com/nfl/draft/player/_/id/23592 | 3 | 3 | 1 | FALSE | from Atlanta | San Francisco 49ers | SF | https://a.espncdn.com/i/teamlogos/nfl/500/scoreboard/sf.png | NA | NA | NA | NA | NA | NA | NA |
| 1967 | 23593 | Bob Griese | Quarterback | QB | Purdue | Boilermakers | PUR | http://insider.espn.com/nfl/draft/player/_/id/23593 | 4 | 4 | 1 | FALSE | NA | Miami Dolphins | MIA | https://a.espncdn.com/i/teamlogos/nfl/500/scoreboard/mia.png | NA | NA | NA | NA | NA | NA | NA |
| 1967 | 23594 | George Webster | Linebacker | LB | Michigan State | Spartans | MSU | http://insider.espn.com/nfl/draft/player/_/id/23594 | 5 | 5 | 1 | FALSE | NA | Houston Oilers | TEN | https://a.espncdn.com/i/teamlogos/nfl/500/scoreboard/ten.png | NA | NA | NA | NA | NA | NA | NA |
| 1967 | 23595 | Floyd Little | Running Back | RB | Syracuse | Orange | SYR | http://insider.espn.com/nfl/draft/player/_/id/23595 | 6 | 6 | 1 | FALSE | NA | Denver Broncos | DEN | https://a.espncdn.com/i/teamlogos/nfl/500/scoreboard/den.png | NA | NA | NA | NA | NA | NA | NA |
# Only keeping players drafted in the 2000 draft and after
prospect_2000 <- subset(nfl_draft_prospects, draft_year >= 2000)
# Visualizing the missing data
prospect_final1 <- prospect_2000
prospect_final1 %>%
vis_miss()
weight and height: It’s used in the combine stats later
pos_rk and ovr_rk: They were directly correlated to draft position and won’t be helpful for trying to predict draft position using other predictors.
years 1967 - 1999: These years weren’t included in the combine data set I found, and they were missing a lot of data in important columns
Division and Conference: Will help us if the school they went to has a correlation with where they are drafted
Drafted: Yes or no for if a player was drafted or not
pf_head <- prospect_final1 %>% head()
pf_head %>%
kable() %>%
kable_styling("striped", full_width = F) %>%
column_spec(1:ncol(pf_head), extra_css = "white-space: nowrap;") %>%
row_spec(0, align = "c") %>%
scroll_box(width = "100%")
| draft_year | Player_Name | position | Pos | school | school_abbr | Division | Conference | overall | round | grade | Drafted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6180 | 2000 | Courtney Brown | Defensive End | DL | Penn State | PSU | FBS | Big Ten | 1 | 1 | NA | Yes |
| 6181 | 2000 | Lavar Arrington | Linebacker | LB | Penn State | PSU | FBS | Big Ten | 2 | 1 | NA | Yes |
| 6182 | 2000 | Chris Samuels | Offensive Tackle | OL | Alabama | ALA | FBS | SEC | 3 | 1 | NA | Yes |
| 6183 | 2000 | Peter Warrick | Wide Receiver | WR | Florida State | FSU | FBS | ACC | 4 | 1 | NA | Yes |
| 6184 | 2000 | Jamal Lewis | Running Back | RB | Tennessee | TENN | FBS | SEC | 5 | 1 | NA | Yes |
| 6185 | 2000 | Corey Simon | Defensive Tackle | DL | Florida State | FSU | FBS | ACC | 6 | 1 | NA | Yes |
combine_resultsSource: combine_results
Author: Mitchell Weg, Publication: June 8th 2023
All 11 of these columns will be useful because they tell the physical
attributes of a player, or give us information for merging with the
nfl_draft_prospects data set, I still need to convert the
height into inches and format the positions and names so they can be
merged with the other data set
#Putting all of the combine results into one data set
combine_results_all <- bind_rows(combine_results_2000, combine_results_2001, combine_results_2002, combine_results_2003, combine_results_2004, combine_results_2005, combine_results_2006, combine_results_2007, combine_results_2008, combine_results_2009, combine_results_2010, combine_results_2011, combine_results_2012, combine_results_2013, combine_results_2014, combine_results_2015, combine_results_2016, combine_results_2017, combine_results_2018, combine_results_2019, combine_results_2020, combine_results_2021,)
# Making a copy
combine_results_all1 <- combine_results_all
cr_head <- combine_results_all1 %>% head()
cr_head %>%
kable() %>%
kable_styling("striped", full_width = F) %>%
column_spec(1:ncol(cr_head), extra_css = "white-space: nowrap;") %>%
row_spec(0, align = "c") %>%
scroll_box(width = "100%")
| Player | Pos | School | Ht | Wt | X40yd | Vertical | Bench | Broad.Jump | X3Cone | Shuttle |
|---|---|---|---|---|---|---|---|---|---|---|
| John Abraham | OLB | South Carolina | 6-4 | 252 | 4.55 | NA | NA | NA | NA | NA |
| Shaun Alexander | RB | Alabama | 6-0 | 218 | 4.58 | NA | NA | NA | NA | NA |
| Darnell Alford | OT | Boston Col. | 6-4 | 334 | 5.56 | 25 | 23 | 94 | 8.48 | 4.98 |
| Kyle Allamon | TE | Texas Tech | 6-2 | 253 | 4.97 | 29 | NA | 104 | 7.29 | 4.49 |
| Rashard Anderson | CB | Jackson State | 6-2 | 206 | 4.55 | 34 | NA | 123 | 7.18 | 4.15 |
| Jake Arians | K | Ala-Birmingham | 5-10 | 202 | NA | NA | NA | NA | NA | NA |
combine_results_all1 %>%
vis_miss()
cra_head <- combine_results_all2 %>% head()
cra_head %>%
kable() %>%
kable_styling("striped", full_width = F) %>%
column_spec(1:ncol(cra_head), extra_css = "white-space: nowrap;") %>%
row_spec(0, align = "c") %>%
scroll_box(width = "100%")
| Player_Name | Pos | School | weight | X40yd | Vertical | Bench | Broad.Jump | X3Cone | Shuttle | height |
|---|---|---|---|---|---|---|---|---|---|---|
| John Abraham | LB | South Carolina | 252 | 4.55 | NA | NA | NA | NA | NA | 76 |
| Shaun Alexander | RB | Alabama | 218 | 4.58 | NA | NA | NA | NA | NA | 72 |
| Darnell Alford | OL | Boston Col. | 334 | 5.56 | 25 | 23 | 94 | 8.48 | 4.98 | 76 |
| Kyle Allamon | TE | Texas Tech | 253 | 4.97 | 29 | NA | 104 | 7.29 | 4.49 | 74 |
| Rashard Anderson | DB | Jackson State | 206 | 4.55 | 34 | NA | 123 | 7.18 | 4.15 | 74 |
| Jake Arians | K | Ala-Birmingham | 202 | NA | NA | NA | NA | NA | NA | 70 |
pcm_inorderP.C.M: Prospect Combine Merge
The merging of both data sets into one giving us a comprehensive list of players that have been drafted between 2000 and 2021, with their combine stats.
pcm_head <- pcm_inorder %>% head()
pcm_head %>%
kable() %>%
kable_styling("striped", full_width = F) %>%
column_spec(1:ncol(pcm_head), extra_css = "white-space: nowrap;") %>%
row_spec(0, align = "c") %>%
scroll_box(width = "100%")
| Player_Name | overall | round | Drafted | draft_year | grade | Pos | position | height | weight | X40yd | Vertical | Bench | Broad.Jump | X3Cone | Shuttle | school_abbr | school | Division | Conference | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1156 | Courtney Brown | 1 | 1 | Yes | 2000 | NA | DL | Defensive End | 77 | 269 | 4.78 | NA | NA | NA | NA | NA | PSU | Penn State | FBS | Big Ten |
| 3382 | Lavar Arrington | 2 | 1 | Yes | 2000 | NA | LB | Linebacker | 75 | 250 | 4.53 | NA | NA | NA | NA | NA | PSU | Penn State | FBS | Big Ten |
| 979 | Chris Samuels | 3 | 1 | Yes | 2000 | NA | OL | Offensive Tackle | 77 | 325 | 5.08 | NA | NA | NA | NA | NA | ALA | Alabama | FBS | SEC |
| 4127 | Peter Warrick | 4 | 1 | Yes | 2000 | NA | WR | Wide Receiver | 71 | 194 | 4.58 | NA | NA | NA | NA | NA | FSU | Florida State | FBS | ACC |
| 2284 | Jamal Lewis | 5 | 1 | Yes | 2000 | NA | RB | Running Back | 72 | 240 | 4.58 | NA | 23 | NA | NA | NA | TENN | Tennessee | FBS | SEC |
| 1127 | Corey Simon | 6 | 1 | Yes | 2000 | NA | DL | Defensive Tackle | 74 | 297 | 4.83 | NA | NA | NA | NA | NA | FSU | Florida State | FBS | ACC |
Player_Name(chr): Name of the playeroverall(int): The overall pick a player was
draftedround(factor): The round a player was drafted inDrafted(factor): If a player was or wasn’t drafteddraft_year(int): The year a player was draftedgrade(int): ESPN’s evaluation of the player, 100 best
and 0 worstPos(factor): Abbreviation of a player’s positionposition(chr): Full length of a player’s positionheight(int): Player’s heightweight(int): Player’s weightX40yd(int): Player’s 40 yard dash timeVertical(int): Player’s vertical leapBench(int): Player’s bench press reps of 225lbsBroad.Jump(int): Player’s broad jumpX3Cone(int): Player’s time for 3 cone drillShuttle(int): Player’s time for shuttle runschool_abbr(chr): Abbreviation of a player’s
collegeschool(chr): Full length of a player’s collegeDivision(factor): If a player played in the FBS or FCS
divisionConference(factor): Which conference the player’s
college was in, FCS or a conference in FBSLooking at the data ordered by year it’s clear that from the years 2000 to 2003, ESPN simply didn’t give prospects a grade, I think the best course of action is to delete these years from the data. It won’t create bias since these years are independent from other draft years.
pcm_inorder %>% vis_miss()
Looking at the missingness by position, the only blanks that aren’t
at random are for Quarter backs, Punters, and
Kickers. Quarter backs don’t record bench at
the combine, while Kickers and Punters only
record 40 yard times. Also, the second gap in bench is due to Wide
Receivers not recording bench from 2004 to 2006. For the
Quarter backs, Punter, and
Kickers I’m inputting 0s for the missingness. I’ll be using
bagged trees to fill in the rest of the missing data.
pos_order %>% vis_miss()
pcmf_head <- pcm_full %>% head()
pcmf_head %>%
kable() %>%
kable_styling("striped", full_width = F) %>%
column_spec(1:ncol(pcmf_head), extra_css = "white-space: nowrap;") %>%
row_spec(0, align = "c") %>%
scroll_box(width = "100%")
| Player_Name | overall | round | Drafted | draft_year | grade | Pos | position | height | weight | X40yd | Vertical | Bench | Broad.Jump | X3Cone | school_abbr | school | Division | Conference | Shuttle |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Alan Reuber | 0 | Undrafted | No | 2004 | 67 | OL | Offensive Guard | 78 | 323 | 5.49 | 29.0 | 26 | 98 | 7.95 | TA&M | Texas A&M | FBS | SEC | 4.91 |
| Andrae Thurman | 0 | Undrafted | No | 2004 | 48 | WR | Wide Receiver | 71 | 192 | 4.54 | 34.5 | 15 | 121 | 7.31 | SOU | Southern Oregon | FCS | FCS | 4.30 |
| Andrew Shull | 0 | Undrafted | No | 2004 | 59 | DL | Defensive End | 77 | 265 | 4.90 | 30.5 | 16 | 107 | 7.46 | KSU | Kansas State | FBS | Big 12 | 4.28 |
| Anthony Herrera | 0 | Undrafted | No | 2004 | 60 | OL | Offensive Guard | 74 | 315 | 5.20 | 28.5 | 26 | 104 | 7.76 | TENN | Tennessee | FBS | SEC | 4.71 |
| Antonio Hall | 0 | Undrafted | No | 2004 | 43 | OL | Offensive Tackle | 75 | 317 | 5.54 | 26.5 | 27 | 101 | 8.12 | UK | Kentucky | FBS | SEC | 4.55 |
| Arnold Parker | 0 | Undrafted | No | 2004 | 58 | DB | Safety | 74 | 213 | 4.54 | 35.5 | 18 | 120 | 6.98 | NA | NA | FCS | FCS | 4.12 |
pcm_full %>% vis_miss()
We have no more missingness!
We can now learn more about our data through visualization… Graphs!
Defensive Back is the highest because DB is really two positions in one: Corner Back and Safety. They make up the majority of the defense so it makes sense that they are drafted the most. Something I didn’t expect was for Wide Receiver to be higher than Running Back, I would have predicted RBs to be higher because on average they have the shortest careers, approximately 2.57 years. But I guess WR are used more on the field on average, usually 3 WR and 1 RB on any given play, so it makes sense they are drafted more often.
pcm_drafted <- subset(pcm_full, Drafted == "Yes")
ggplot(pcm_drafted, aes(x = fct_infreq(Pos))) +
geom_bar(fill = 'navy') +
labs(title = "Prospects Drafted by Position", x = "Position", y = "Players Drafted") +
theme_minimal()
What I notice in this chart is that Offensive linemen, Defensive linemen, and Quarter Backs are picked more often in the first round. O-linemen are picked the 5th most in the draft but are picked the 3rd most in the first round, while QBs are picked the 9th most in the draft but are picked the 6th most in the first round. Also, D-linemen are picked the most in the first round and second most as the first overall pick. This tells me that DL, OL, and QB are the most valuable positions in football, since they are picked before anyone else.
pcm_full_r1 <- subset(pcm_full, round == "1")
pcm_full_r11 <- subset(pcm_full, round == "1" & overall == "1")
pos_r1 <- ggplot(pcm_full_r1, aes(x = fct_infreq(Pos))) +
geom_bar(fill = 'navy') +
labs(title = "Prospects Drafted in the First Round", x = "Position", y = "Prospects") +
theme_minimal()
pos_r11 <- ggplot(pcm_full_r11, aes(x = fct_infreq(Pos))) +
geom_bar(fill = 'navy') +
labs(title = "Prospects Drafted First Overall", x = "Position", y = "Prospects") +
theme_minimal()
grid.arrange(pos_r1, pos_r11, ncol = 2)
QBs are most often picked in the first round and take a sharp drop in the second round. RBs are picked the most in the 4th round, and take a steep drop in round five, likely because in round five is when most kickers are selected
pcm_QB <- subset(pcm_full, Pos == "QB" & Drafted == "Yes")
QB_pct <- as.data.frame(prop.table(table(pcm_QB$round)) * 100)
pcm_RB <- subset(pcm_full, Pos == "RB" & Drafted == "Yes")
RB_pct <- as.data.frame(prop.table(table(pcm_RB$round)) * 100)
pcm_K <- subset(pcm_full, Pos == "K" & Drafted == "Yes")
K_pct <- as.data.frame(prop.table(table(pcm_K$round)) * 100)
QB_chart <- ggplot(QB_pct, aes(x = Var1, y = Freq)) +
geom_bar(stat = "identity", fill = 'navy') +
ylim(c(0,40)) +
labs(title = "Quarter Backs", x = "Rounds", y = "Percentage") +
theme_minimal()
RB_chart <- ggplot(RB_pct, aes(x = Var1, y = Freq)) +
geom_bar(stat = "identity", fill = 'navy') +
ylim(c(0,40)) +
labs(title = "Running Backs", x = "Rounds", y = "Percentage") +
theme_minimal()
K_chart <- ggplot(K_pct, aes(x = Var1, y = Freq)) +
geom_bar(stat = "identity", fill = 'navy') +
ylim(c(0,40)) +
labs(title = "Kickers", x = "Rounds", y = "Percentage") +
theme_minimal()
grid.arrange(K_chart, RB_chart, QB_chart, ncol = 3, top = "Percentage Picked by Round")
SEC is by far the most popular conference NFL teams draft from, some schools from this conference include: Alabama, Georgia, LSU, Tennessee, Texas A&M and other elite level programs
ggplot(pcm_drafted, aes(x = fct_infreq(Conference))) +
geom_bar(fill = 'navy') +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Prospects Drafted by Conference", x = "Conference", y = "Players Drafted")
None of the top five schools move positions, but this graph shows the tremendous gap between the five power conferences and everyone else
ggplot(pcm_full_r1, aes(x = fct_infreq(Conference))) +
geom_bar(fill = 'navy') +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Prospects Drafted by Conference in the First Round", x = "Conference", y = "Players Drafted")
They don’t seem to be getting faster every year, but it’s obvious that DL are faster than OL. There also appears to be a standard for DL to have a 5 second 40yd time, since every year they’re hovering around that speed.
pcm_WR_RB <- subset(pcm_full, pcm_full$Pos == "OL" | pcm_full$Pos == "DL",
select = c(draft_year, Pos, X40yd))
pcm_WR_RB <- aggregate(X40yd ~ Pos + draft_year, data = pcm_WR_RB, FUN = mean)
ggplot(pcm_WR_RB, aes(fill=Pos, x=draft_year, y=X40yd)) +
geom_bar(position = "Dodge", stat = "identity") +
ylim(c(0,6)) +
labs(x = 'Draft Year', y = '40 Yard Time') +
theme_minimal()
NFL prospects have gotten significantly faster since 2015, but oddly since 2015 bench press has also gotten significantly lower. This may be due to stricter rules for bench and loser rules for 40yd time in 2016. Or maybe more lighter weight positions like DB and WR were invited to the combine over heavier players like O and D-lineman, bringing down the average for bench and increasing the speed in 40 times.
pcm_40_avg <- aggregate(X40yd ~ draft_year, pcm_full, mean)
# Getting rid of QBs Punters and Kickers from bench since they are all zero
pcm_Bench_avg <- subset(pcm_full, !(Pos %in% c("QB", "K", "P")))
pcm_Bench_avg <- aggregate(Bench ~ draft_year, pcm_Bench_avg, mean)
x40_avg_line <- ggplot(pcm_40_avg, aes(x = draft_year, y = X40yd)) +
geom_line() +
labs(y = "40yd Time", x = "Year", title = "40yd Time") +
theme_minimal()
Bench_avg_line <- ggplot(pcm_Bench_avg, aes(x = draft_year, y = Bench)) +
geom_line() +
labs(x = "Year", y = "Bench Reps", title = "Bench Press") +
theme_minimal()
grid.arrange(x40_avg_line, Bench_avg_line, ncol = 2, top = "Average Results per Year")
Finally, we can start creating out models
Making a model that tries to predict 8 classes wouldn’t be effective. It’s too many for any model to predict, so I’m going to instead split the rounds into 4 classes.
pcm_full$round <- gsub("\\b(1|2)\\b", "1st or 2nd", pcm_full$round, ignore.case = TRUE)
pcm_full$round <- gsub("\\b(3|4)\\b", "3rd or 4th", pcm_full$round, ignore.case = TRUE)
pcm_full$round <- gsub("\\b(5|6)\\b", "5th or 6th", pcm_full$round, ignore.case = TRUE)
pcm_full$round <- gsub("\\b(7|Undrafted)\\b", "7th or UD", pcm_full$round, ignore.case = TRUE)
1st or 2nd: First or Second Round Draft Pick
(High)3rd or 4th: Third or Fourth Round Draft Pick (Middle
High)5th or 6th: Fifth or Sixth Round Draft Pick (Middle
Low)7th or UD: Seventh or Undrafted (Low)# Cleaning the variable names
pcm_full_c <- clean_names(pcm_full)
# Changing necessary character variables to factors
pcm_full_c$round <- as.factor(pcm_full_c$round)
pcm_full_c$drafted <- as.factor(pcm_full_c$drafted)
pcm_full_c$pos <- as.factor(pcm_full_c$pos)
pcm_full_c$conference <- as.factor(pcm_full_c$conference)
# Changing all dbl to int
pcm_full_c <- pcm_full_c %>% mutate_if(is.double, as.integer)
set.seed(1936)
# Splitting the data by 80% and stratifying by round
nfl_split <- initial_split(pcm_full_c, prop = 0.80, strata = round)
nfl_train <- training(nfl_split)
nfl_test <- testing(nfl_split)
dim(nfl_train)
## [1] 3703 20
dim(nfl_test)
## [1] 927 20
3,703 observations for the training data and 927 observations for the testing data
Is the data imbalanced enough where stratified sampling for cross-validation is necessary?
ggplot(pcm_full_c, aes(x = fct_infreq(pcm_full$round))) +
geom_bar(fill = 'navy') +
labs(x = 'Classes', y = "Players Drafted") +
theme_minimal()
Looking at the graph, there is a significant enough imbalance between
7th or UD and the other classes that stratified sampling is
warranted
nfl_fold <- vfold_cv(nfl_train, v = 5, strata = round)
For my predictors I’m going to use:
grade, pos, height,
weight, x40yd, vertical,
bench, broad_jump, x3Cone,
shuttle, and conference
nfl_recipe <- recipe(round ~ grade + pos + height + weight + x40yd + vertical + bench +
broad_jump + x3cone + shuttle + conference, data = nfl_train) %>%
step_dummy(all_nominal_predictors()) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors())
prep(nfl_recipe) %>%
bake(new_data = nfl_train) %>%
head() %>%
kable() %>%
kable_styling("striped", full_width = F) %>%
column_spec(1:32, extra_css = "white-space: nowrap;") %>%
row_spec(0, align = "c") %>%
scroll_box(width = "100%")
| grade | height | weight | x40yd | vertical | bench | broad_jump | x3cone | shuttle | round | pos_DL | pos_FB | pos_K | pos_LB | pos_LS | pos_OL | pos_P | pos_QB | pos_RB | pos_TE | pos_WR | conference_American | conference_Big.12 | conference_Big.Ten | conference_Conference.USA | conference_FBS.Independents | conference_FCS | conference_Mid.American | conference_Mountain.West | conference_Pac.12 | conference_SEC | conference_Sun.Belt |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1.875020 | 1.1804448 | -0.4748162 | -0.5261406 | -0.0390914 | -2.4138263 | 0.1481131 | 0.2786202 | 0.1382879 | 1st or 2nd | -0.4366048 | -0.0918694 | -0.0903632 | -0.370941 | -0.0435135 | -0.421161 | -0.1004491 | 4.1735094 | -0.3038181 | -0.2494638 | -0.4054886 | -0.1665748 | -0.3528072 | -0.4176002 | -0.1193266 | -0.1691254 | -0.3163729 | -0.1407973 | -0.211784 | -0.3877069 | 1.9562380 | -0.1304749 |
| 1.927180 | 1.9259995 | 1.7791165 | -0.5261406 | -0.3796929 | 0.6349141 | -0.4734706 | 0.2786202 | 0.1382879 | 1st or 2nd | -0.4366048 | -0.0918694 | -0.0903632 | -0.370941 | -0.0435135 | 2.373748 | -0.1004491 | -0.2395418 | -0.3038181 | -0.2494638 | -0.4054886 | -0.1665748 | -0.3528072 | 2.3939881 | -0.1193266 | -0.1691254 | -0.3163729 | -0.1407973 | -0.211784 | -0.3877069 | -0.5110472 | -0.1304749 |
| 1.927180 | 0.4348901 | -0.3864267 | -0.5261406 | 0.4718107 | -0.1272710 | 0.4871587 | 0.2786202 | 0.1382879 | 1st or 2nd | -0.4366048 | -0.0918694 | -0.0903632 | -0.370941 | -0.0435135 | -0.421161 | -0.1004491 | -0.2395418 | -0.3038181 | -0.2494638 | 2.4654947 | -0.1665748 | -0.3528072 | -0.4176002 | -0.1193266 | -0.1691254 | -0.3163729 | -0.1407973 | -0.211784 | -0.3877069 | -0.5110472 | -0.1304749 |
| 1.718537 | 1.1804448 | -0.2980371 | 1.8941063 | -0.0390914 | -2.4138263 | 0.1481131 | 0.2786202 | 0.1382879 | 1st or 2nd | -0.4366048 | -0.0918694 | -0.0903632 | -0.370941 | -0.0435135 | -0.421161 | -0.1004491 | 4.1735094 | -0.3038181 | -0.2494638 | -0.4054886 | -0.1665748 | -0.3528072 | -0.4176002 | -0.1193266 | -0.1691254 | -0.3163729 | -0.1407973 | -0.211784 | -0.3877069 | -0.5110472 | -0.1304749 |
| 1.927180 | 0.0621128 | -0.2759398 | -0.5261406 | 0.4718107 | -0.1272710 | 0.4871587 | 0.2786202 | 0.1382879 | 1st or 2nd | -0.4366048 | -0.0918694 | -0.0903632 | -0.370941 | -0.0435135 | -0.421161 | -0.1004491 | -0.2395418 | -0.3038181 | -0.2494638 | -0.4054886 | -0.1665748 | -0.3528072 | -0.4176002 | -0.1193266 | -0.1691254 | -0.3163729 | -0.1407973 | -0.211784 | -0.3877069 | -0.5110472 | -0.1304749 |
| 1.875020 | 0.0621128 | -0.6736926 | -0.5261406 | 0.4718107 | -0.3813327 | 0.4871587 | 0.2786202 | 0.1382879 | 1st or 2nd | -0.4366048 | -0.0918694 | -0.0903632 | -0.370941 | -0.0435135 | -0.421161 | -0.1004491 | -0.2395418 | -0.3038181 | -0.2494638 | 2.4654947 | -0.1665748 | 2.8336440 | -0.4176002 | -0.1193266 | -0.1691254 | -0.3163729 | -0.1407973 | -0.211784 | -0.3877069 | -0.5110472 | -0.1304749 |
I have chosen five models that I believe will fit my data the best.
Elastic Net Regression, Gradient Boosted Trees, K-Nearest Neighbors,
Latent Dirichlet Allocation (LDA), and Random Forest. I will train each
model using the training data, I will also tune parameters for each
model if necessary. I will then measure their performance off of their
respective auc_roc value. Finally, I will fit the testing
data to the best performing model.
save(net_tune_nfl, file = "/Users/brend/OneDrive/Documents/131 Project/net tune.rda")
save(knn_tune_nfl, file = "/Users/brend/OneDrive/Documents/131 Project/knn tune.rda")
save(rf_tune_nfl, file = "/Users/brend/OneDrive/Documents/131 Project/rf tune 2.rda")
save(bt_tune_nfl, file = "/Users/brend/OneDrive/Documents/131 Project/bt tune.rda")
# Loading in saved tunning
load("/Users/brend/OneDrive/Documents/131 Project/net tune.rda")
load("/Users/brend/OneDrive/Documents/131 Project/knn tune.rda")
load("/Users/brend/OneDrive/Documents/131 Project/rf tune 2.rda")
load("/Users/brend/OneDrive/Documents/131 Project/bt tune.rda")
For the elastic net model the parameters penalty and mixture were
tuned. Penalty controls the complexity of the model and mixture controls
the balance between the ridge regression and lasso regression. Looking
at the graph for roc_auc, the model would perform best with
a small mixture value and a large penalty value.
autoplot(net_tune_nfl, metric = 'roc_auc')
The only parameter that needs to be tuned for knn are
the number of neighbors that can be in range. The graph shows us that as
the number of neighbors increases, the model performs better. But, it
doesn’t perform as well as the other models we tried.
autoplot(knn_tune_nfl, metric = 'roc_auc')
In a random forest there are three parameters that need to be tuned.
mrty: number of predictors that will be randomly
sampled at each split of the tree model
trees: number of trees in the model
min_n: minimum number of data points in a node that
are required for the mode to be further split
The graphs tell me the data fits a high mrty, a medium
number of trees, and a high min_n value.
autoplot(rf_tune_nfl, metric = 'roc_auc')
This model uses the mtry and trees like
random forest, but it instead uses learn_rate that controls
how much weight each tree has on the overall model.
Looking at the performance, a low learn rate, low predictors, and low trees seemed to work best for fitting the data.
autoplot(bt_tune_nfl, metric = 'roc_auc')
Even though the LDA model performed the best, I’m more confident in the elastic net model, so that is what I will be using to fit my data.
roc_results %>%
kable() %>%
kable_styling("striped", full_width = T)
| Models | .metric | mean |
|---|---|---|
| LDA | roc_auc | 0.8424373 |
| Elastic Net | roc_auc | 0.8353271 |
| Random Forest | roc_auc | 0.8316450 |
| Boosted Tree | roc_auc | 0.8164595 |
| K-Nearest Neighbors | roc_auc | 0.7585771 |
The best parameters for the Elastic Net model are:
best_net_nfl %>% dplyr::select(.config, penalty, mixture) %>%
kable() %>%
kable_styling(full_width = T)
| .config | penalty | mixture |
|---|---|---|
| Preprocessor1_Model097 | 0.0004642 | 1 |
After fitting the parameters and applying the final fitted model to the testing data we get an roc_auc curve of approximatly 0.83.
roc_auc(final_net_nfl_test, truth = round, '.pred_1st or 2nd':'.pred_7th or UD') %>%
kable() %>%
kable_styling(full_width = T)
| .metric | .estimator | .estimate |
|---|---|---|
| roc_auc | hand_till | 0.8326798 |
This is an okay estimate, it’s what we expected based on our tuning.
Grade was by far the most important predictor for determining where a player is drafted. This makes intuitive sense since if a player is more highly graded (better), then they’ll likely get drafted sooner. Broad jump and weight look like the most important physical attributes for determining draft position. Also having the kicker, D-lineman, or a tight end position was an indicator for draft spot. I’m surprised and slightly disappointed that the conference a player played it had little to no effect on the outcome (it took a long time to clean that data).
final_net_nfl %>% extract_fit_parsnip() %>%
vip()
It appears it’s much easier to predict first and second round selections compared to other rounds. This makes sense because there is a massive skill/talent gap between the first two rounds and everyone else in the draft.
roc_curve(final_net_nfl_test, truth = round, '.pred_1st or 2nd':'.pred_7th or UD') %>%
autoplot()
The gap between the first & second round and everyone else is apparent. The model rarely predicted a highly rated prospect incorrectly. However, the model was especially bad at predicting 5th and 6th round players effectively. It mostly categorized 5th and 6th round players as 7th or undrafted. This could be due to the categories having too much in common with each other, making them harder to discern. Also having four classes makes it harder for the model to predict outcomes, having a model that only predicts two outcomes, like drafted or not drafted, would be better.
conf_mat(final_net_nfl_test, truth = round, .pred_class) %>%
autoplot(type = "heatmap")
I’ve learned that it is very challenging to predict where players will be drafted. There are a lot more factors that go into drafting players. Other factors include: what position the drafting team is looking to fill, if the player has off-field issues, how old the player is, if the player is injury prone, and several other variables. It simply wasn’t enough to just use a player’s combine stats, and what conference they played in. My model also heavily relied on the ESPN’s grading system which is disappointing, once the grades started becoming lower and inconsistent, it was harder for the model to predict players. Also, it was challenging for the model to predict a multiclass problem, the more classes you try to predict, the harder it is for the model to make the correct predictions. This model can definitely be improved with better predictors, more data, and fewer classes.